Machine Learning on Amazon Retail Data
  • Code
  • By Bhavana
  1. Data Prep / EDA
  • Home
  • Introduction
  • Data Prep / EDA
  • Models and Methods
    • ARM (Association Rule Mining)
    • Naive Bayes
    • Clustering
    • Decision Trees
    • Neural Networks
    • Regression
    • SVM (Support Vector Machine)
  • Conclusions

On this page

  • Data Collection
  • Data Cleaning
  • Data Preprocessing / Visualization
    • Investigation of products in both sets of data

Data Prep / EDA

Where the data source, processing, and visualization (EDA) is presented.

Data Collection

Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.

import requests

payload = {
   'api_key': 'API_KEY',
   'query': 'iphone 15 charger',
   's': 'price-asc-rank'
}

response = requests.get('https://api.scraperapi.com/structured/amazon/search',
                        params=payload).json()

The jupyter notebook code for the web scraping can be found here.

Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).

The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.

Table 1: The raw data from both datasets.
(a) The raw data scraped from Amazon using ScraperAPI
type position asin name image has_prime is_best_seller is_amazon_choice is_limited_deal stars total_reviews url availability_quantity spec price_string price_symbol price original_price section_name
0 search_product 15 B08KF9J2NW ENDGAME GEAR MPX390 Non-Slip Gaming Mouse Pad;... https://m.media-amazon.com/images/I/811RrKRBw1... False False False False 4.6 8.0 https://www.amazon.com/ENDGAME-GEAR-Combinatio... NaN {} NaN NaN NaN NaN NaN
1 search_product 26 B003SIOZYI Duracell Coppertop AAA Battery, 8 Count https://m.media-amazon.com/images/I/71j7GhQom9... False False False False 4.8 561.0 https://www.amazon.com/Duracell-Coppertop-AAA-... NaN {} NaN NaN NaN NaN NaN
2 search_product 36 B0BBMHDF5Z Kensington Duo Gel Mouse, Keyboard Wrist Rest ... https://m.media-amazon.com/images/I/41TCRgokr5... False False False False NaN NaN https://www.amazon.com/Kensington-Mouse-Keyboa... NaN {} $59.63 $ 59.63 NaN NaN
3 search_product 29 B0CV9CY4FG Zerodeko mini small air conditioner Air condit... https://m.media-amazon.com/images/I/71JvJWamsc... False False False False NaN NaN https://www.amazon.com/dp/B0CV9CY4FG/ref=sr_1_... NaN {} $27.19 $ 27.19 NaN NaN
4 search_product 43 B0927H84JT Long Lost Family - Season 5 https://m.media-amazon.com/images/I/71c2Z-JD6t... False False False False 5.0 1.0 https://www.amazon.com/Baby-Shoe-Box/dp/B091Y6... NaN {} NaN NaN NaN NaN NaN
(b) The raw data gotten from Kaggle
asin title imgUrl productURL stars reviews price listPrice category_id isBestSeller boughtInLastMonth
0 B0CCVNS1ZP 5 Pairs Mini Finger Shoes, Cool Mini Skateboar... https://m.media-amazon.com/images/I/51vvpDYs-W... https://www.amazon.com/dp/B0CCVNS1ZP 0.0 0 12.39 0.0 232 False 0
1 B09GWDRDCF MKS eco Oil - Hair Styling Elixir - Moroccan A... https://m.media-amazon.com/images/I/6109b0phmz... https://www.amazon.com/dp/B09GWDRDCF 4.8 0 23.99 0.0 47 False 500
2 B0C7PJSQSM VR Charging Station, USB C Cable Magnetic Char... https://m.media-amazon.com/images/I/41-Wtg0AYc... https://www.amazon.com/dp/B0C7PJSQSM 0.0 0 0.00 0.0 259 False 0
3 B0C3LSLN7N ReHisk 32 Inch 4K UHD (3840 x 2160) IPS Comput... https://m.media-amazon.com/images/I/81WD20RbcB... https://www.amazon.com/dp/B0C3LSLN7N 0.0 0 309.99 0.0 56 False 0
4 B00NGWZFIG Plasticplace Trash Bags │ 1.5 Mil │ Black Heav... https://m.media-amazon.com/images/I/51Jyj6iu2Z... https://www.amazon.com/dp/B00NGWZFIG 4.5 0 64.29 0.0 167 False 0

Data Cleaning

The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.

The steps to clean the web-scaped data were:

  • Add date_scraped column
  • Remove unecessary columns: type, position, has_prime, is_amazon_choice, is_limited_deal, availability_quantity, spec, price_string, price_symbol, section_name
  • Expand and fix original_price
  • Rename columns to match standard snake case for merging both datasets
  • Drop rows with no asin or name or price
  • Drop rows with price of 0.0, since that doesn’t make sense
  • Fill NaN reviews column with 0

The steps to clean the Kaggle data were:

  • Add date_scraped column
  • Drop rows with any NaNs
  • Fix list_price of $0 to be instead equal to price
  • Change category_id to actual category by using category table
  • Drop rows with price of $0, since that doesn’t make sense
  • Rename columns to match standard snake case for merging both datasets

And then, after they were concatenated, the steps to clean were:

  • Remove duplicates (by asin + date scraped)
  • Rename columns

The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):

Table 2: The final unioned, cleaned, and processed data.
Asin Name Image Url Is Best Seller Stars Reviews Url Price Date Scraped List Price Bought In Month Category
0 B08DF85GXW Mens Guayabera Shirts Linen Short Sleeve Butto... https://m.media-amazon.com/images/I/51D6lya6mb... False 4.1 0.0 https://www.amazon.com/dp/B08DF85GXW 21.98 2023-11-01 29.99 0.0 Men's Clothing
1 B0BGY1F6YG Women's Casual Long Sleeve Bodycon Sweater Dre... https://m.media-amazon.com/images/I/71KkOkd1IQ... False 4.6 0.0 https://www.amazon.com/dp/B0BGY1F6YG 39.99 2023-11-01 49.99 0.0 Women's Clothing
2 B00CE04GVS Magformers Inspire (14-pieces)Set Magnetic Bui... https://m.media-amazon.com/images/I/71Ez54D6gU... False 4.7 0.0 https://www.amazon.com/dp/B00CE04GVS 24.99 2023-11-01 24.99 0.0 Building Toys
3 B08YNGR4T7 Handrails for Outdoor Steps,3 Step Handrail Fi... https://m.media-amazon.com/images/I/71x-ZmTQGW... False 4.4 957.0 https://www.amazon.com/dp/B08YNGR4T7 99.99 2023-11-01 99.99 0.0 Industrial Hardware
4 B09WQ7M8DF CUPKIN Stackable Stainless Steel Kids Cup Smoo... https://m.media-amazon.com/images/I/51yzzIGJkH... False 4.1 0.0 https://www.amazon.com/dp/B09WQ7M8DF 14.99 2023-11-01 14.99 0.0 Baby & Toddler Feeding Supplies

The code for the data cleaning can be found here.

Data Preprocessing / Visualization

Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.

Important

If the interactive figures don’t load, dont worry: just turn off all ad-blockers/privacy browsing, make sure you are using Chrome/Firefox, and refresh the page until all figures load.

Figure 1

Figure 1: A histogram of all categories of all Amazon products. Note scraped data did not have categories, but the Kaggle data did.

Figure 2

Figure 2: Histogram of the number of stars recieved for all products.

Figure 3

Violin plot of the number of a product bought in a month

Violin plot of price of a product

Violin plot of the number of reviews
Figure 3: Violin plots (with overlayed box plot and outliers) of 3 of the main numerical columns.

Figure 4

Figure 4: Stars vs number of reviews recieved by an amazon product, colored by whether the product was a best-seller.

Figure 5

Wordcloud for categories of products

Wordcloud for the names of products
Figure 5: Wordclouds (where more frequent appearing words are bigger) of the categories of products and the names of products.

Figure 6

Figure 6: The number of stars a product recieved, vs the price in log scale, colored by whether the product was a best seller.

Investigation of products in both sets of data

The following plots focused on products that had the same Asin (identifying ID), that were in both the data personally scraped in 2024 and the Kaggle data from 2023. Investigating this subset of data could reveal a lot about how product prices, reviews, and other factors changed over time, as the only way to get time difference data was to use products that actually had more than one timepoint.

Figure 7

Figure 7: Price vs list price of items with the same ASIN across dates scraped, with a trendline.

Given that we can see outliers in price affecting the plot of the graph, it was decided for analysis to only consider those prices most populous, aka prices less than $800.

Figure 8

Figure 8: Histogram of prices, colored by whether the change in price increased or decreased over time, for those items that were in both sets of data.

Figure 9

Figure 9: Price vs the difference in price, over the two sets of data, colored by whether the price diff increased or decreased.

Only categories with more than 20 products were included, to reduce noise and to focus on meaningful categories. Figure 10

Figure 10: For the included categories, a bar chat showing the ratio of products that had their price increase over time vs decrease.
Introduction
ARM (Association Rule Mining)